{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8a4125b6-79c0-4cba-8654-a5e4c40bc321",
   "metadata": {},
   "source": [
    "# How to implement vector-search using `superduperdb` on Snowflake\n",
    "\n",
    "In this use-case we describe how to implement vector-search using `superduperdb` on Snowflake. \n",
    "\n",
    "## Configure `superduperdb` to work with Snowflake\n",
    "\n",
    "The first step in doing this is \n",
    "to connect to your snowflake account. When you log in it should look something like this:\n",
    "\n",
    "![](/img/snowflake-login.png)\n",
    "\n",
    "The important thing you need to get from this login is the **organization-id** and **user-id** from the menu in the bottom right (annotated on the image). You will set these values in the cell below.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "908e6da9-1486-4a8f-a1f6-36f9e8c6f2e7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# We set this value, since Snowflake via `ibis` doesn't support `bytes` directly.\n",
    "os.environ['SUPERDUPERDB_BYTES_ENCODING'] = 'Str'\n",
    "\n",
    "from superduperdb import superduper, CFG\n",
    "\n",
    "user = \"<USERNAME>\"\n",
    "password = \"<PASSWORD>\"\n",
    "account = \"WSWZPKW-LN66790\"  # ORGANIZATIONID-USERID\n",
    "\n",
    "def make_uri(database):\n",
    "    return f\"snowflake://{user}:{password}@{account}/{database}\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c794c86e-46f6-4d0a-b43b-518c3bbe28ca",
   "metadata": {},
   "source": [
    "## Set up sample data to test vector-search\n",
    "\n",
    "We're going to use some of the Snowflake sample data in this example, namely the `FREE_COMPANY_DATASET`. You \n",
    "can find the `FREE_COMPANY_DATASET` on [this link](https://app.snowflake.com/marketplace/listing/GZSTZRRVYL2/people-data-labs-free-company-dataset).\n",
    "\n",
    "Since the database where this data is hosted is read-only, we copy a sample of the data to our own dataset, and work with the data there."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "05569589-ac6b-44f9-a1d0-0e1f2028f575",
   "metadata": {},
   "outputs": [],
   "source": [
    "from superduperdb.backends.ibis.query import RawSQL\n",
    "\n",
    "db = superduper(\n",
    "    make_uri(\"FREE_COMPANY_DATASET/PUBLIC\"),\n",
    "    metadata_store='sqlite:///.testdb.db'\n",
    ")\n",
    "\n",
    "sample = db.execute(RawSQL('SELECT * FROM FREECOMPANYDATASET SAMPLE (5000 ROWS);')).as_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "019195c7-be05-4f01-a394-589e8a67acb9",
   "metadata": {},
   "source": [
    "## Connect to your dedicated vector-search database\n",
    "\n",
    "We use the connection we created to get the snapshot, to also create the databset we are going to work with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e2359b71-004b-4bf5-8ac4-7a4b349a74d7",
   "metadata": {},
   "outputs": [],
   "source": [
    "db.databackend.conn.create_database('SUPERDUPERDB_EXAMPLE', force=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "efbbeca2-6d8e-45bb-ae7c-0f1272c98033",
   "metadata": {},
   "source": [
    "Now we are ready to connect to this database with `superduperdb`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e2e46108-8333-4f2c-8a17-1048db369bae",
   "metadata": {},
   "outputs": [],
   "source": [
    "from superduperdb.backends.ibis.query import RawSQL\n",
    "\n",
    "db = superduper(\n",
    "    make_uri(\"SUPERDUPERDB_EXAMPLE/PUBLIC\"),\n",
    "    metadata_store='sqlite:///.testdb.db'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "adc2c8f7-58cd-4c36-b191-79f684a3999e",
   "metadata": {},
   "source": [
    "Since `superduperdb` implements extra features on top of your classical database/ datalake, it's necessary\n",
    "to add the tables you wish to work with to the system. You'll notice we are creating a schema as well; that allows\n",
    "us to implement \"interesting\" data-types on top of Snowflake, such as images or audio."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0868c228-7e81-4bc8-b5a6-85f7477a1ba4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from superduperdb.backends.ibis.query import Table\n",
    "from superduperdb.backends.ibis.field_types import dtype\n",
    "from superduperdb import Schema\n",
    "\n",
    "_, t = db.add(\n",
    "    Table(\n",
    "        'MYTABLE',\n",
    "        primary_id='ID',\n",
    "        schema=Schema(\n",
    "            'MYSCHEMA',\n",
    "            fields={\n",
    "                k: dtype('str') \n",
    "                for k in sample.columns\n",
    "            }\n",
    "        )\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ebc90a4c-7ea4-4e46-a45e-4eb47b4d9c67",
   "metadata": {},
   "source": [
    "If you log back into Snowflake now it should look like this:\n",
    "\n",
    "![](/img/snowflake-table.png)\n",
    "\n",
    "You'll see that the database and table have been created."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "423d5e2a-8a70-46e0-b8a0-b23291f4fd3c",
   "metadata": {},
   "source": [
    "## Insert data into the vector-search table\n",
    "\n",
    "Before inserting the data, we'll do a little bit of cleaning. The reason for this, is we want to have clean ids which uniquely define \n",
    "the rows which we are working with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2d225c42-073d-4471-a78a-258ca06373f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "sample.ID = sample.ID.str.replace('[^A-Za-z0-9\\-]', '', regex=True).str.replace('[-]+', '-', regex=True)\n",
    "sample[sample.isnull()] = None\n",
    "del sample['FOUNDED']\n",
    "\n",
    "random_id = lambda: str(random.randint(100000, 999999))\n",
    "sample.ID = sample.ID.apply(lambda x: x + '-' + random_id())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc763e37-a1f9-459a-9493-2e7f20110770",
   "metadata": {},
   "outputs": [],
   "source": [
    "t = db.load('table', 'MYTABLE')\n",
    "t"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24b39ca7-5501-4abe-970c-de3f42a37f44",
   "metadata": {},
   "source": [
    "Now that we've created the table we want to work with, we can insert the sample data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6af18701-a0fc-4c06-8cd0-e3147e333cfc",
   "metadata": {},
   "outputs": [],
   "source": [
    "db.execute(t.insert(sample))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30737a4b-8262-4fbb-bf1a-53133136e941",
   "metadata": {},
   "source": [
    "Let's check this was successful by fetching some data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "079ae06d-fece-4e37-a4f5-827af416735c",
   "metadata": {},
   "outputs": [],
   "source": [
    "list(db.execute(t.limit(5)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b58904af-7f37-4ea9-b446-7cb0639eff1f",
   "metadata": {},
   "source": [
    "In the next step, we're going to port a model from `sentence_transformers` to `superduperdb` and use this model in searching the rows \n",
    "of the table with vector-search. You can see that, in-addition to the `sentence_transformers` model, `superduperdb` allows\n",
    "developers to specify a preprocessing (and postprocessing) function in their `Model` instances. In this case, \n",
    "the `preprocess` argument tells the model how to convert with the individual lines of data (dictionaries) to strings, so that the model can understand these lines:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8257e154-d725-4516-9aa3-f75c332c8f79",
   "metadata": {},
   "outputs": [],
   "source": [
    "from superduperdb.ext.sentence_transformers import SentenceTransformer\n",
    "from superduperdb.ext.numpy import array\n",
    "\n",
    "model = SentenceTransformer(\n",
    "    identifier='all-MiniLM-L6-v2',\n",
    "    preprocess=lambda r: '; '.join([f'{k}={v}' for k, v in r.items()]),\n",
    "    encoder=array(dtype='float32', shape=(384,)),\n",
    "    predict_method='encode',\n",
    "    batch_predict=True,\n",
    "    device='mps',\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5071f494-3e24-4745-a763-8529c13ba086",
   "metadata": {},
   "source": [
    "This model is then sent to the `VectorIndex` component via `Listener` and registered with the system\n",
    "with `db.add`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d24dafa9-688d-408b-b247-a5fc1a3dee68",
   "metadata": {},
   "outputs": [],
   "source": [
    "from superduperdb import Listener, VectorIndex\n",
    "\n",
    "db.add(\n",
    "    VectorIndex(\n",
    "        identifier='my-index',\n",
    "        indexing_listener=Listener(\n",
    "            select=t,\n",
    "            key='_base',\n",
    "            model=model,\n",
    "            predict_kwargs={'max_chunk_size': 500, 'batch_size': 30},\n",
    "        ),\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c73c1623-6278-435b-8169-63fb239cb157",
   "metadata": {},
   "source": [
    "This step will take a few moments (unless you have a GPU to hand).\n",
    "\n",
    ":::important\n",
    "**Once this step is finished you can \n",
    "search Snowflake with vector-search!**\n",
    ":::"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35db33b4-7374-4401-8977-bd8a73b5ae40",
   "metadata": {},
   "outputs": [],
   "source": [
    "from superduperdb import Document\n",
    "\n",
    "cur = db.execute(\n",
    "    t\n",
    "        .like(Document({'description': 'A management consultancy company based in the USA'}), vector_index='my-index', n = 3)\n",
    "        .limit(3)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f5a62d21-9137-4b64-8276-b1a1a55a3cd2",
   "metadata": {},
   "source": [
    "We can view the results as a dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61b9e60b-050e-4ffc-ba08-be1b5a727cba",
   "metadata": {},
   "outputs": [],
   "source": [
    "cur.as_pandas()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
